etcdserver: separate "raft log compact" from snapshot #18235

clement2026 · 2024-06-26T06:09:50Z

I'm looking into #17098 @serathius and made changes to separate the compact logic from the snapshot logic.

I ran rw-benchmark.sh several times to observe performance changes. I expected this patch to slow down etcd, but it seems to have improved rw-benchmark.sh results. I’m not sure how. Similar results were observed on both cloud VMs (8vCPU, 16GB Memory, NVMe) and a bare-metal machine (8CPU, 32GB Memory, SSD).

Since the rw-benchmark.sh script was taking forever to finish, I adjusted the parameters to complete the benchmark within a few hours.

Below are the script and the results from running it on a Vultr bare-metal machine (8 CPUs, 32GB Memory, SSD, Ubuntu 24.04 LTS x64).

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh

main.csv
issue-17098-patch.csv

Signed-off-by: Clement <[email protected]>

k8s-ci-robot · 2024-06-26T06:09:59Z

Hi @clement2026. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius · 2024-06-26T11:42:03Z

/ok-to-test

Signed-off-by: Clement <[email protected]>

serathius · 2024-06-26T12:14:25Z

/retest

serathius · 2024-06-26T12:15:42Z

server/etcdserver/server.go

-		compacti := uint64(1)
-		if snapi > s.Cfg.SnapshotCatchUpEntries {
-			compacti = snapi - s.Cfg.SnapshotCatchUpEntries
+		compacti := uint64(0)


Why change from 1 to 0?

https://github.com/etcd-io/raft/blob/ee19bbe7f9bb16fd7d874a6c08dfac319596d2ad/storage.go#L258-L260

if compactIndex > ms.lastIndex() { getLogger().Panicf("compact %d is out of bound lastindex(%d)", compactIndex, ms.lastIndex()) }

A panic occurred at this line during local test runs. It appears the initial value of ms.lastIndex() is 0, so compacti should default to zero.

In the main branch, a panic doesn't occur because ms always contains more than one element when compaction starts, which isn't the case in this patch.

serathius · 2024-06-26T12:17:40Z

Great work, the change look good, but as any change in the etcd apply loop we need to be careful here.

I expected this patch to slow down etcd, but it seems to have improved rw-benchmark.sh results. I’m not sure how.

Would be good to investigate, would you be able to collect CPU&Memory profiles?

serathius · 2024-06-26T12:43:34Z

Found by robustness tests:

panic: need non-empty snapshot

goroutine 210 [running]:
go.etcd.io/raft/v3.(*raft).maybeSendAppend(0xc0002be180, 0xeabdbb777cf498cb, 0x1)
	go.etcd.io/raft/[email protected]/raft.go:640 +0x838
go.etcd.io/raft/v3.(*raft).sendAppend(...)
	go.etcd.io/raft/[email protected]/raft.go:592
go.etcd.io/raft/v3.stepLeader(0xc0002be180, {0x9, 0xbf19ae4419db00dc, 0xeabdbb777cf498cb, 0x3, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
	go.etcd.io/raft/[email protected]/raft.go:1545 +0x99d
go.etcd.io/raft/v3.(*raft).Step(0xc0002be180, {0x9, 0xbf19ae4419db00dc, 0xeabdbb777cf498cb, 0x3, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
	go.etcd.io/raft/[email protected]/raft.go:1215 +0xe3c
go.etcd.io/raft/v3.(*node).run(0xc000018240)
	go.etcd.io/raft/[email protected]/node.go:399 +0x825
created by go.etcd.io/raft/v3.StartNode in goroutine 1
	go.etcd.io/raft/[email protected]/node.go:273 +0x59

clement2026 · 2024-06-26T12:47:36Z

Great work, the change look good, but as any change in the etcd apply loop we need to be careful here.

I expected this patch to slow down etcd, but it seems to have improved rw-benchmark.sh results. I’m not sure how.

Would be good to investigate, would you be able to collect CPU&Memory profiles?

Thanks! I understand the importance of the apply loop. I'll investigate it and provide CPU & Memory profiles after resolving the bug and test failures.

Signed-off-by: Clement <[email protected]>

clement2026 · 2024-06-27T08:59:38Z

/retest

serathius · 2024-06-28T08:28:27Z

Thanks for fixing test, please note this PR is small, but might have big impact on etcd behavior, it's review might take a while. Would be great to repeat the tests and see the profile difference to understand performance difference.

cc @ahrtr

clement2026 · 2024-06-28T08:57:47Z

Thanks for fixing test, please note this PR is small, but might have big impact on etcd behavior, it's review might take a while. Would be great to repeat the tests and see the profile difference to understand performance difference.

cc @ahrtr

Thanks for checking in. I temporarily fixed the test to test my ideas, but it might cause a serious memory leak. Currently, I'm developing an optimized version to properly fixed the test. I’ll update you once it's done and provide the CPU and memory profiles.

…all ongoing snapshots Signed-off-by: Clement <[email protected]>

clement2026 · 2024-06-28T18:24:30Z

/retest

Signed-off-by: Clement <[email protected]>

clement2026 · 2024-06-28T18:38:45Z

/retest

Signed-off-by: Clement <[email protected]>

clement2026 · 2024-07-01T21:01:53Z

I have completed some benchmarks. Here’s the heatmap comparison between this PR and the main branch:

main.csv
pr-18235-patch.csv

Since the full benchmark is time-consuming, the results were generated using this simplified script:

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh

The script was executed on a Vultr cloud VM with 8 vCPUs, 32 GB of memory, and a 150 GB NVMe drive. All subsequent results are based on this machine configuration unless otherwise specified.

Updates:
The script was executed on both the main branch and the patch branch separately, each taking about 2 hours(I forgot to record the exact time).

clement2026 · 2024-07-01T21:26:36Z

To verify that this patch improves memory usage, I collected go_memstats_heap_inuse_bytes using Prometheus and Grafana.

patch:

main:

Step 1: start etcd with pprof enabled

ETCD_BIN="./bin/etcd"
BACKEND_SIZE="$((130 * 1024 * 1024 * 1024))" # 130GB

${ETCD_BIN} --quota-backend-bytes=${BACKEND_SIZE} \
  --log-level 'error' \
  --enable-pprof

Step 2: keep writing data to etcd

ETCD_BM_BIN="./bin/tools/benchmark"
KEY_SPACE_SIZE="$((1024 * 64))"
VALUE_SIZE="$((2 ** 20))" # 1MB
RUN_COUNT="50000"

${ETCD_BM_BIN} put "" \
  --key-space-size=${KEY_SPACE_SIZE} \
  --total=${RUN_COUNT} \
  --val-size ${VALUE_SIZE}

clement2026 · 2024-07-01T22:22:35Z

Benchmark etcd using a small CONN_CLI_COUNT, a small VALUE_SIZE and a big RUN_COUNT(1000,000).

The benchmark took about 600s, andgo tool pprof took 700s to capture CPU profile.

Branch	Heap before benchmark	Heap after benchmark	Profile
main	main-before-benchmark.inuse_space.pb.gz	main-after-benchmark.inuse_space.pb.gz	main-pprof.etcd.samples.cpu.001.pb.gz
patch	patch-before-benchmark.inuse_space.pb.gz	patch-after-benchmark.inuse_space.pb.gz	patch-pprof.etcd.samples.cpu.001.pb.gz

Images produced by executing go tool pprof -pdf -diff_base $file_a $file_b:

compare-profile.pdf

comparing-heap-after-benchmark.pdf

Compared to the main branch, this patch uses more memory and consumes less CPU.

Below are the steps to reproduce.

Step 1: Prepare ENV

ETCD_BIN="./bin/etcd"
ETCD_BM_BIN="./bin/tools/benchmark"
BACKEND_SIZE="$((20 * 1024 * 1024 * 1024))" # 20GB
KEY_SPACE_SIZE="$((1024 * 64))"
RATIO="4.0"
KEY_SIZE="256"
RUN_COUNT="1000000"
RANGE_RESULT_LIMIT="100"
VALUE_SIZE="$((2 ** 10))" # VALUE_SIZE="$((2 ** 14))" 
CONN_CLI_COUNT="$((2**5))"

Step 2: start etcd with pprof enabled

${ETCD_BIN} --quota-backend-bytes=${BACKEND_SIZE} \
  --log-level 'error' \
  --enable-pprof

Step 3: init etcd with data

${ETCD_BM_BIN} put --sequential-keys \
  --key-space-size=${KEY_SPACE_SIZE} \
  --val-size=${VALUE_SIZE} --key-size=${KEY_SIZE} \
  --total=${KEY_SPACE_SIZE} \
  &>/dev/null

Step 4: capture heap usage
Step 5: start capturing CPU profile for 700s

Step 6: benchmark

${ETCD_BM_BIN} txn-mixed "" \
  --conns=${CONN_CLI_COUNT} --clients=${CONN_CLI_COUNT} \
  --total=${RUN_COUNT} \
  --rw-ratio ${RATIO} --limit ${RANGE_RESULT_LIMIT} \
  --val-size ${VALUE_SIZE}

Step 7: capture heap usage
Step 8: complete capturing CPU profile after 700s

clement2026 · 2024-07-01T22:46:15Z

Benchmark etcd using a big CONN_CLI_COUNT, a big VALUE_SIZE and a big RUN_COUNT(1000,000).

The benchmark took about 1100s, andgo tool pprof took 1200s to capture CPU profile.

Branch	Heap before benchmark	Heap after benchmark	Profile
main	main-before-benchmark.inuse_space.pb.gz	main-after-benchmark.inuse_space.pb.gz	main-pprof.etcd.samples.cpu.001.pb.gz
patch	patch-before-benchmark.inuse_space.pb.gz	patch-after-benchmark.inuse_space.pb.gz	patch-pprof.etcd.samples.cpu.001.pb.gz

Images produced by executing go tool pprof -pdf -diff_base $file_a $file_b:

compare-profile.pdf

compare-heap.pdf

Compared to the main branch, this patch uses less memory and consumes less CPU.

The steps to reproduce are nearly identical to the previous benchmark description, except for the different ENV settings.

Step 1: Prepare ENV

ETCD_BIN="./bin/etcd"
ETCD_BM_BIN="./bin/tools/benchmark"
BACKEND_SIZE="$((20 * 1024 * 1024 * 1024))" # 20GB
KEY_SPACE_SIZE="$((1024 * 64))"
RATIO="4.0"
KEY_SIZE="256"
RUN_COUNT="1000000"
RANGE_RESULT_LIMIT="100"
VALUE_SIZE="$((2 ** 14))"
CONN_CLI_COUNT="$((2**11))"

clement2026 · 2024-07-01T22:58:39Z

I am still struggling to understand why this patch reduced the go.etcd.io/etcd/server/v3/storage/mvcc.(*keyIndex).get time.

clement2026 · 2024-07-01T23:05:57Z

@serathius
I've fixed the bug and updated some benchmark results to help us review this PR’s performance. The information might be a bit messy, please bear with me. Let me know if further information is needed.

serathius · 2024-07-02T07:55:59Z

@clement2026 great work, it's a lot of information and it should be enough to make some conclusions. I think what we are missing is a clear summarization of your findings. What test we did, what improvement/regresion the test showed.

Please see kubernetes/enhancements#2340 (comment) for an example

clement2026 · 2024-07-02T08:09:40Z

Thanks for the example. I’ll create a summary of the findings, including the tests conducted and their results.

clement2026 · 2024-07-03T11:05:32Z

@serathius Here’s a summary of the three tests I’ve done. Please review it when you have a moment.

Test 1 #18235 (comment)

rw-benchmark R/W ratio 4:1

Improvements

2% ~ 12% improvement to read performance
1% ~ 13% improvement to write performance

Test 2 #18235 (comment)

benchmark put 50000 values of 1 MB each, within 6 minutes

Branch	Max Memory
main	31 GB
patch	12 GB

Improvements

59% reduction in maximum memory usage

Test 3

1 KB: #18235 (comment)
16 KB: #18235 (comment)

benchmark txn-mixed 1,000,000 transactions, within 20 minutes

Value Size	Branch	Initial Memory	Final Memory	CPU Usage
1 KB	main	35.5 MB	45.83 MB	4050.75s
	patch	39.4 MB	52.01 MB	3584.51s
16 KB	main	188.97 MB	884.82 MB	6552.40s
	patch	119.57 MB	753.51 MB	5954.52s

Improvements

12% improvement in CPU usage for 1 KB transactions
9% improvement in CPU usage for 16 KB transactions
15% reduction in memory usage for 16 KB transactions

Degradations

13% increase in memory usage for 1 KB transactions - unexpected ☹️

Updates:

Add links to detailed test information.
Correct "16 MB" to "16 KB."

clement2026 · 2024-07-03T11:57:00Z

I find testing and recording the memory usage of etcd to be quite time-consuming. We already have rw-benchmark.sh for collecting read/write performance data. Are there similar scripts for gathering memory usage statistics?

serathius · 2024-07-03T20:38:47Z

I feel your pain, it shouldn't take so much time to iterate to validate performance improvement. There is an open issue to improve the current situation, however there hasn't been a lot of progress. #16467

The change looks overall good to me. Usage of heap might be overkill here, I don't expect too many concurrent snapshots being executed, however users might set snapshot count very low, which for a big db file could cause many concurrent snapshot attempts.

The improvement in performance is awesome! Don't think that increase in memory for small objects is surprising or concerning. Executing compaction after each apply might be overkill. Would you want to try checking if reducing frequency to once every 100/1000 would be better?

Happy to see the change passes the robustness test. We might want to also try to think if we could add some more testing to ensure this change is correct.
cc @MadhavJivrajani @siyuanfoundation @fuweid for ideas on extending robustness tests.

cc @ahrtr to confirm change correctness and performance benefits.

clement2026 · 2024-07-04T02:55:15Z

This is great advice. I'll reduce the compaction frequency to once every 100/1000 runs and observe the benchmark results. With such a low frequency, replacing the min heap with a simple slice should not impact performance and will enhance readability and maintenance.

ahrtr · 2024-07-04T13:44:05Z

Thanks for the PR, but I think the complexity introduced in this PR outweighs the benefit. We should always consider simplify ing the core apply workflow, so as to improve the understandability and maintainability.

Previously the creating snapshot is performed in a separate goroutine async (see below link), because the v2 snapshot contained all key data, so it might be huge and accordingly it may take a long time to create a snapshot. But it isn't that case anymore, v2store & v2 snapshot only contains membership data now.

etcd/server/etcdserver/server.go

Line 2129 in 1b25193

s.GoAttach(func() {

So I think we can start with turning it into a sync operation, and a direct & immediate benefit is big simplification. But we need to evaluate the impact on performance.

Afterwards, we can discuss any further improvement step by step.

serathius · 2024-07-04T14:03:34Z

Thanks for the PR, but I think the complexity introduced in this PR outweighs the benefit. We should always consider simplify ing the core apply workflow, so as to improve the understandability and maintainability.

I wouldn't phrase it as "complexity outweighs benefits", however I would agree that should consider simplify code first. It's a good suggestion to start from making operation synchronous.

@clement2026 Could you prepare a separate PR that changes the snapshot + compact into sync operation with apply loop and see how this impacts performance? Removing GoAttach should be enough.

Please let me know if you need any help

clement2026 · 2024-07-04T17:16:05Z

@ahrtr @serathius Wow, that sounds great! Converting snapshot and compact into sync operations should significantly simplify things. I'll start with the sync operation in a new PR and see how it goes. If everything works well, I'll rewrite this PR based on that one.

ivanvc · 2024-07-05T06:17:16Z

Since the rw-benchmark.sh script was taking forever to finish, I adjusted the parameters to complete the benchmark within a few hours.

Out of curiosity, @clement2026, what parameters did you use to run your benchmarks? And how long does it take to run them?

clement2026 · 2024-07-05T07:04:52Z

@ivanvc Initially, I ran the script with the default parameters, but it didn't complete overnight. I then terminated the script and adjusted the parameters to ensure it would finish within a limited timeframe. With the modified parameters, the script completed in a few hours(I did not record the exact time):

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh

I ran the same script twice last night to benchmark another PR, taking a total of 4 hours and 13 minutes (approximately 2 hours and 6 minutes per run). FYI: #18283 (comment)

siyuanfoundation · 2024-07-08T18:03:43Z

I think this PR demonstrates the great potential benefits for this idea. Should the change be guarded by a feature gate to be safe?

serathius · 2024-07-24T08:45:00Z

@clement2026 With #18283 merged what would be the best way to move forward? Is this PR still relevant?

cc @ahrtr

k8s-ci-robot · 2024-07-24T08:45:08Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

clement2026 · 2024-07-24T09:11:57Z

@clement2026 With #18283 merged what would be the best way to move forward? Is this PR still relevant?

cc @ahrtr

Based on #18283, we can implement this feature with less code and make it simpler. So, we don't need this PR anymore. I’ll create a new PR to keep things moving forward.

Thanks, everyone, for taking the time to review it!

Closing this.

etcdserver: separate compact from snapshot

1998a71

Signed-off-by: Clement <[email protected]>

k8s-ci-robot added do-not-merge/work-in-progress needs-ok-to-test labels Jun 26, 2024

k8s-ci-robot added the size/S label Jun 26, 2024

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jun 26, 2024

etcdserver: skip compaction if there is an inflight snapshot

8dc2a42

Signed-off-by: Clement <[email protected]>

serathius reviewed Jun 26, 2024

View reviewed changes

etcdserver: skip compaction if there are snapshots being created

641eb41

Signed-off-by: Clement <[email protected]>

k8s-ci-robot added size/M and removed size/S labels Jun 27, 2024

serathius changed the title ~~WIP: etcdserver: separate compact from snapshot~~ WIP: etcdserver: separate "raft log compact" from snapshot Jun 28, 2024

k8s-ci-robot added size/L and removed size/M labels Jun 28, 2024

etcdserver: compact the raft log up to the minimum snapshot index of …

93d0484

…all ongoing snapshots Signed-off-by: Clement <[email protected]>

clement2026 force-pushed the issues-17098 branch from abcae5f to 93d0484 Compare June 28, 2024 18:24

fix lint errorgs

302254b

Signed-off-by: Clement <[email protected]>

clement2026 force-pushed the issues-17098 branch from adb970c to 302254b Compare June 28, 2024 18:38

fix vet errors

fcb154f

Signed-off-by: Clement <[email protected]>

clement2026 marked this pull request as ready for review July 1, 2024 22:48

clement2026 changed the title ~~WIP: etcdserver: separate "raft log compact" from snapshot~~ etcdserver: separate "raft log compact" from snapshot Jul 2, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress label Jul 2, 2024

ahrtr self-requested a review July 4, 2024 05:55

clement2026 mentioned this pull request Jul 4, 2024

etcdserver: change the snapshot + compact into sync operation #18283

Merged

fuweid self-requested a review July 18, 2024 03:33

k8s-ci-robot added the needs-rebase label Jul 24, 2024

clement2026 closed this Jul 24, 2024

clement2026 deleted the issues-17098 branch July 27, 2024 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: separate "raft log compact" from snapshot #18235

etcdserver: separate "raft log compact" from snapshot #18235

clement2026 commented Jun 26, 2024

k8s-ci-robot commented Jun 26, 2024

serathius commented Jun 26, 2024

serathius commented Jun 26, 2024

serathius Jun 26, 2024

clement2026 Jun 26, 2024

serathius commented Jun 26, 2024 •

edited

Loading

serathius commented Jun 26, 2024

clement2026 commented Jun 26, 2024

clement2026 commented Jun 27, 2024

serathius commented Jun 28, 2024

clement2026 commented Jun 28, 2024

clement2026 commented Jun 28, 2024

clement2026 commented Jun 28, 2024

clement2026 commented Jul 1, 2024 •

edited

Loading

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

serathius commented Jul 2, 2024

clement2026 commented Jul 2, 2024

clement2026 commented Jul 3, 2024 •

edited

Loading

clement2026 commented Jul 3, 2024

serathius commented Jul 3, 2024 •

edited

Loading

clement2026 commented Jul 4, 2024

ahrtr commented Jul 4, 2024

serathius commented Jul 4, 2024 •

edited

Loading

clement2026 commented Jul 4, 2024

ivanvc commented Jul 5, 2024

clement2026 commented Jul 5, 2024

siyuanfoundation commented Jul 8, 2024

serathius commented Jul 24, 2024

k8s-ci-robot commented Jul 24, 2024

clement2026 commented Jul 24, 2024

etcdserver: separate "raft log compact" from snapshot #18235

etcdserver: separate "raft log compact" from snapshot #18235

Conversation

clement2026 commented Jun 26, 2024

k8s-ci-robot commented Jun 26, 2024

serathius commented Jun 26, 2024

serathius commented Jun 26, 2024

serathius Jun 26, 2024

Choose a reason for hiding this comment

clement2026 Jun 26, 2024

Choose a reason for hiding this comment

serathius commented Jun 26, 2024 • edited Loading

serathius commented Jun 26, 2024

clement2026 commented Jun 26, 2024

clement2026 commented Jun 27, 2024

serathius commented Jun 28, 2024

clement2026 commented Jun 28, 2024

clement2026 commented Jun 28, 2024

clement2026 commented Jun 28, 2024

clement2026 commented Jul 1, 2024 • edited Loading

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

clement2026 commented Jul 1, 2024

serathius commented Jul 2, 2024

clement2026 commented Jul 2, 2024

clement2026 commented Jul 3, 2024 • edited Loading

Test 1 #18235 (comment)

Test 2 #18235 (comment)

Test 3

clement2026 commented Jul 3, 2024

serathius commented Jul 3, 2024 • edited Loading

clement2026 commented Jul 4, 2024

ahrtr commented Jul 4, 2024

serathius commented Jul 4, 2024 • edited Loading

clement2026 commented Jul 4, 2024

ivanvc commented Jul 5, 2024

clement2026 commented Jul 5, 2024

siyuanfoundation commented Jul 8, 2024

serathius commented Jul 24, 2024

k8s-ci-robot commented Jul 24, 2024

clement2026 commented Jul 24, 2024

serathius commented Jun 26, 2024 •

edited

Loading

clement2026 commented Jul 1, 2024 •

edited

Loading

clement2026 commented Jul 3, 2024 •

edited

Loading

serathius commented Jul 3, 2024 •

edited

Loading

serathius commented Jul 4, 2024 •

edited

Loading